1 Imaging Data

1.1 Loading and Initial Exploration of the 4D MRI Data

We’ll begin by loading the 4D MRI data from the provided URL.

#install.packages(c("oro.nifti","ggplot2")
library(oro.nifti)
## oro.nifti 0.11.4
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:oro.nifti':
## 
##     slice
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:oro.nifti':
## 
##     slice
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Download and read the 4D fMRI data
fMRIURL <- "http://socr.umich.edu/HTML5/BrainViewer/data/fMRI_FilteredData_4D.nii.gz"
fMRIFile <- file.path(tempdir(), "fMRI_FilteredData_4D.nii.gz")
download.file(fMRIURL, dest=fMRIFile, quiet=TRUE)
fMRIVolume <- readNIfTI(fMRIFile, reorient=FALSE)

# Check the dimensions of the data
fMRIVolDims <- dim(fMRIVolume)
fMRIVolDims
## [1]  64  64  21 180
# Plot the 4D array of imaging data
image(fMRIVolume, zlim=range(fMRIVolume)*0.95)

## Distribution of Voxel Intensities Let’s visualize the distribution of voxel intensities across the entire dataset.

hist(fMRIVolume, main="Histogram of 4D MRI Data", col="lightblue", border="black")

# Plot an orthographic display of the fMRI data
orthographic(fMRIVolume, xyz=c(34,29,10), zlim=range(fMRIVolume)*0.9)

stat_fmri_test <- ifelse(fMRIVolume > 15000, fMRIVolume, NA)

h <- hist(stat_fmri_test, plot = F)

plot_ly(x = h$mids, y = h$density, type = "bar") %>%
  layout(title = "fMRI Histogram (high intensities)")
dim(stat_fmri_test)
## [1]  64  64  21 180
overlay(fMRIVolume, fMRIVolume[,,,5], zlim.x=range(fMRIVolume)*0.95)

## Time-Series Analysis of Voxels We’ll extract and visualize the time series of a Reference, near-by and distant voxel.

# Reference
plot(fMRIVolume[30, 30, 10,], type='l', main="Time Series of 3D Reference Voxel (x=30, y=30, z=10)", col="blue")

# Applying smoothing
x1 <- 1:180
lines(x1, smooth(fMRIVolume[30, 30, 10,]), col="red", lwd=2)
lines(ksmooth(x1, fMRIVolume[30, 30, 10,], kernel="normal", bandwidth=5), col="green", lwd=3)
legend("bottomright", legend=c("Raw fMRI", "Loess-smoothed fMRI", "Kernel-smoothed fMRI"), col=c("blue", "red", "green"), lty=1, lwd=2)

# Near-by Voxels
plot(fMRIVolume[31, 31, 10,], type='l', main="Time Series of 3D Near-by Voxel (x=35, y=35, z=10)", col="blue")

# Applying smoothing
lines(x1, smooth(fMRIVolume[31, 31, 10,]), col="red", lwd=2)
lines(ksmooth(x1, fMRIVolume[31, 31, 10,], kernel="normal", bandwidth=5), col="green", lwd=3)
legend("bottomright", legend=c("Raw fMRI", "Loess-smoothed fMRI", "Kernel-smoothed fMRI"), col=c("blue", "red", "green"), lty=1, lwd=2)

#Distant Voxels
plot(fMRIVolume[40, 40, 15,], type='l', main="Time Series of 3D  Distant Voxel (x=40, y=40, z=15)", col="blue")

# Applying smoothing
lines(x1, smooth(fMRIVolume[40, 40, 15,]), col="red", lwd=2)
lines(ksmooth(x1, fMRIVolume[40, 40, 15,], kernel="normal", bandwidth=5), col="green", lwd=3)
legend("bottomright", legend=c("Raw fMRI", "Loess-smoothed fMRI", "Kernel-smoothed fMRI"), col=c("blue", "red", "green"), lty=1, lwd=2)

### Reference Voxel:

The time series from this voxel showed fluctuations between values of approximately 13888 to 14157. This pattern suggests some regularity in the signal, which might be related to specific brain activities or physiological mechanisms.

Loess-smoothing provides a curve that captures the general trend of the data without the high-frequency noise.Kernel-smoothing also captures the overall trend, offering a similar but slightly different curve compared to Loess.

1.1.1 Near-by Voxel

Being spatially close to Reference Voxel, the time series from Near-by voxel exhibited a similar fluctuating pattern, with values ranging roughly from 12955 to 13588. Even though the range of values differed slightly from voxel1, the overall pattern appeared to have some similarities. This can be attributed to the spatial autocorrelation inherent in MRI data, where nearby voxels tend to have similar signal patterns.

1.1.2 Distant Voxel

The signal from this distant voxel varied between approximately 11351 to 11644. The range of fluctuation was different from both Reference Voxel and Near-by Voxel, indicating a distinct pattern. Being located farther away, the signal from Distant Voxel might have been influenced by different brain regions or activities compared to the first two voxels.

1.2 Correlation Analysis of 3D Spatial Locations in MRI Data

voxel1 <- fMRIVolume[30, 30, 10,]
voxel2 <- fMRIVolume[31, 31, 10,]
voxel3 <- fMRIVolume[40, 40, 15,]

cor1_2 <- cor(voxel1, voxel2)
cor1_3 <- cor(voxel1, voxel3)

# Reference vs Near-by Voxels
cor1_2
## [1] 0.677499
# Reference vs Distant Voxels
cor1_3
## [1] 0.1363766

1.2.1 Correlation between Reference and Near-by Voxels (cor1_2):

The correlation coefficient between these two voxels was found to be 0.677499. This value suggests a strong positive linear relationship between the time series of Reference and Near-by Voxels. Given their spatial proximity, this observation aligns with our expectations, reinforcing the idea of spatial autocorrelation in MRI data. Nearby voxels tend to exhibit similar signal fluctuations due to inherent brain activities or physiological mechanisms.

1.2.2 Correlation between Reference and Distant Voxels (cor1_3):

The correlation coefficient between these two voxels was found to be 0.1363766. This value indicates a weak positive linear relationship between the time series of Reference and Distant Voxels. The weaker correlation, as compared to Near-by Voxels, highlights that Distant Voxels, being spatially distant from Reference Voxels, might be influenced by different brain regions or activities.

2 Tabular data

2.1 Extracting the Data

We will extract the Google Web-Search Trends and Stock Market Data from the given URL using the rvest package.

library(rvest)
library(ggplot2)
library(TTR)
library(forecast)

# The URL containing the table
url <- "https://wiki.socr.umich.edu/index.php/SOCR_Data_GoogleTrends_2005_2011"

# Extract the tables from the web page
web_data <- url %>%
  read_html() %>%
  html_table(fill = TRUE)

# Assuming the table we need is the first one
google_trends_data <- web_data[[1]]

# Setting the first row as the column names
colnames(google_trends_data) <- google_trends_data[1, ]

# Removing the first row now
google_trends_data <- google_trends_data[-1, ]

# View the adjusted data
head(google_trends_data)
## # A tibble: 6 × 26
##   Index Date   Unemployment Rental RealEstate Mortgage Jobs  Investing DJI_Index
##   <chr> <chr>  <chr>        <chr>  <chr>      <chr>    <chr> <chr>     <chr>    
## 1 1     2005-… 1.36         0.87   0.93       1.18     1.08  1.04      10729.43 
## 2 2     2005-… 1.38         0.89   0.96       1.22     1.12  1.06      10729.43 
## 3 3     2005-… 1.41         0.91   0.98       1.24     1.17  1.08      10729.43 
## 4 4     2005-… 1.44         0.93   0.99       1.25     1.23  1.08      10630.78 
## 5 5     2005-… 1.45         0.94   0.99       1.25     1.28  1.08      10597.83 
## 6 6     2005-… 1.48         0.96   0.99       1.26     1.34  1.09      10622.88 
## # ℹ 17 more variables: StdDJI <chr>, Unemployment30MA <chr>, Rental30MA <chr>,
## #   RealEstate30MA <chr>, Mortgage30MA <chr>, Jobs30MA <chr>,
## #   Investing30MA <chr>, DJI_Index30MA <chr>, StdDJI_30MA <chr>,
## #   Unemployment180MA <chr>, Rental180MA <chr>, RealEstate180MA <chr>,
## #   Mortgage180MA <chr>, Jobs180MA <chr>, Investing180MA <chr>,
## #   DJI_Index180MA <chr>, StdDJI_180MA <chr>

2.2 Time Series Plot for “Jobs”

Plotting the time series data for the variable “Job”.

library(ggplot2)

# Convert Date column to Date type
google_trends_data$Date <- as.Date(google_trends_data$Date, format = "%Y-%m-%d")

# Convert Job column to numeric type (after removing commas)
google_trends_data$Jobs <- as.numeric(gsub(",", "", google_trends_data$Jobs))
google_trends_data$Jobs <- ts(google_trends_data$Jobs,frequency=365)

# Plotting
ggplot(google_trends_data, aes(x = Date, y = Jobs)) +
  geom_line() +
  labs(title = "Time Series for Job", x = "Date", y = "Job Index")

##Applying TTR for Monthly Smoothing Using the TTR package, we apply a simple moving average for monthly smoothing.

library(TTR)

# Apply monthly simple moving average
google_trends_data$Jobs_SMA <- SMA(google_trends_data$Jobs, n = 30) # Using 30 days for monthly average

# Compute the Exponential Moving Average for extra reference
google_trends_data$Jobs_EMA <- EMA(google_trends_data$Jobs, n = 30) # Using 30 days for monthly average

# Visualization
ggplot(google_trends_data, aes(x = Date)) +
  geom_line(aes(y = Jobs, color = "Observed"), alpha = 0.5) +
  geom_line(aes(y = Jobs_SMA, color = "SMA")) +
  geom_line(aes(y = Jobs_EMA, color = "EMA")) +
  labs(title = "Google Trends 'Jobs' Data with SMA & EMA Smoothing", x = "Date", y = "Jobs Index") +
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title.position = "plot",
        plot.title = element_text(hjust = 0.5))  +
  scale_color_manual(values = c("Observed" = "black", "SMA" = "red", "EMA" = "blue"),
                     breaks = c("Observed", "SMA", "EMA"))

##Differencing Parameter Determine the appropriate differencing parameter for the ARIMA model.

library(forecast)

# Determine the differencing parameter for the "Job" time series
# Using ADF test

d<- ndiffs(google_trends_data$Jobs)
d
## [1] 1
# Differencing sample
google_trends_data$bj.diff <- c(NA, diff(google_trends_data$Jobs, differences = 1))
google_trends_data$bj.diff2 <- c(NA, NA, diff(google_trends_data$Jobs, differences = 2))

# Visualization using ggplot2
ggplot(google_trends_data, aes(x = Date)) +
  geom_line(aes(y = bj.diff, color = "1st diff"), na.rm = TRUE) +
  geom_line(aes(y = bj.diff2, color = "2nd diff"), na.rm = TRUE) +
  labs(title = "Google Trends 'Jobs' Data: First & Second Differencing", x = "Date", y = "Jobs Index") +
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title.position = "plot",
        plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("1st diff" = "red", "2nd diff" = "blue"))

dsma<- ndiffs(google_trends_data$Jobs_SMA)
dsma
## [1] 1
# Calculate the first and second differences of EMA
google_trends_data$SMA_diff1 <- c(NA, diff(google_trends_data$Jobs_SMA, differences = 1))
google_trends_data$SMA_diff2 <- c(NA, NA, diff(google_trends_data$Jobs_SMA, differences = 2))

# Visualization using ggplot2
ggplot(google_trends_data, aes(x = Date)) +
  geom_line(aes(y = SMA_diff1, color = "1st diff of SMA"), na.rm = TRUE) +
  geom_line(aes(y = SMA_diff2, color = "2nd diff of SMA"), na.rm = TRUE) +
  labs(title = "Google Trends 'Jobs' Data: SMA First & Second Differencing", x = "Date", y = "Jobs Index") +
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title.position = "plot",
        plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("1st diff of SMA" = "red", "2nd diff of SMA" = "blue"))

dema<- ndiffs(google_trends_data$Jobs_EMA)
dema
## [1] 1
# Calculate the first and second differences of EMA
google_trends_data$EMA_diff1 <- c(NA, diff(google_trends_data$Jobs_EMA, differences = 1))
google_trends_data$EMA_diff2 <- c(NA, NA, diff(google_trends_data$Jobs_EMA, differences = 2))

# Visualization using ggplot2
ggplot(google_trends_data, aes(x = Date)) +
  geom_line(aes(y = EMA_diff1, color = "1st diff of EMA"), na.rm = TRUE) +
  geom_line(aes(y = EMA_diff2, color = "2nd diff of EMA"), na.rm = TRUE) +
  labs(title = "Google Trends 'Jobs' Data: EMA First & Second Differencing", x = "Date", y = "Jobs Index") +
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        plot.title.position = "plot",
        plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("1st diff of EMA" = "red", "2nd diff of EMA" = "blue"))

The ndiffs() function recommends a differencing parameter of 1 for the original “Jobs” series, its monthly simple moving average (SMA), and its exponential moving average (EMA). This indicates that all three series—whether raw or smoothed—become stationary after a single differencing. Plots of the first and second differences visually confirm this.

Given the comparable stationary nature and p-values of both the SMA and EMA first-differenced series, the decision hinges on the dataset’s specificities and the analysis’s goal. The SMA offers a balanced representation by capturing underlying trends and reducing noise without emphasizing recent observations. In contrast, while the EMA emphasizes recent data points, the original series provides an unaltered view, albeit with potential noise.

Despite the inherent noise in the original dataset, it offers an authentic representation of observed values, capturing the full essence, including short-term fluctuations and sudden spikes. Utilizing smoothed data might obscure these nuances, potentially missing out on vital short-term effects. Moreover, the original data keeps the analysis straightforward, avoids added complexities of transformations, and ensures no pivotal information is discarded. Thus, for a more genuine and comprehensive insight, the decision was made to proceed with the original data for further analysis.

##AR and MA Parameters

library(tseries)
# Removing NAs and then applying Augmented Dickey-Fuller test
bj.diff_no_na <- na.omit(google_trends_data$bj.diff)

# Compute ACF and PACF values
acf_vals <- acf(bj.diff_no_na, plot = FALSE)
pacf_vals <- pacf(bj.diff_no_na, plot = FALSE)

# 95% CI
ci <- qnorm((1 + 0.95)/2)/sqrt(sum(!is.na(bj.diff_no_na)))

# ACF Plot
ggplot(data.frame(lag = acf_vals$lag[-1], acf = acf_vals$acf[-1]), aes(x = lag, y = acf)) +
  geom_hline(yintercept = c(ci, -ci), linetype = "dashed") +
  geom_bar(stat="identity", fill = "steelblue") +
  labs(title = "ACF Plot for First Differenced Series", x = "Lag", y = "Autocorrelation") +
  theme_minimal()

## ACF (Autocorrelation Function) Observations: The ACF of the differenced data starts with a strong positive autocorrelation of 0.601 at lag 1. This suggests that the series at a given time is closely related to its previous observation.The ACF slowly decays, hinting at a potential MA process.Given the 95% confidence interval (CI) of approximately 0.0401, the first few lags are particularly significant as they lie outside this interval.

# PACF Plot
df_pacf <- data.frame(lag = 1:(length(pacf_vals$acf) - 1), pacf = pacf_vals$acf[-1])

ggplot(df_pacf, aes(x = lag, y = pacf)) +
  geom_hline(yintercept = c(ci, -ci), linetype = "dashed") +
  geom_bar(stat="identity", fill = "darkred", position = position_dodge(width=0.9)) +
  labs(title = "PACF Plot for First Differenced Series", x = "Lag", y = "Partial Autocorrelation") +
  theme_minimal()

2.3 PACF (Partial Autocorrelation Function) Observations:

The PACF displays a strong spike of 0.601 at lag 1, suggesting an AR(1) process. This indicates that the series at a given time is influenced by its immediate previous observation once the effects of other lags are accounted for. After lag 1, PACF values decrease in magnitude and oscillate around zero, without any clear patterns of significant spikes.

2.4 Interpretation and ARIMA Model Determination:

The pronounced spike at lag 1 in PACF indicates an AR term of order 1, i.e. p=1. The slowly decaying ACF suggests the possibility of an MA process. Given the significance at the initial lags, we might consider MA terms in that range. Even though the ACF and PACF may indicate significance at certain lags, it’s crucial to remember that ARIMA models can start from 0 for the p (AR) and q (MA) terms. Sometimes, simpler models can be more interpretable and effective.

Considering the above observations, potential ARIMA models to consider include:

ARIMA(0,1,0) ARIMA(1,1,0) ARIMA(0,1,1) ARIMA(1,1,1)

However, remember that these are just starting points. Model selection should be based on diagnostic checks of the residuals, as well as criteria like the AIC or BIC, and out-of-sample forecast performance.Eventually, We will perform auto.arima to assist in finding the optimal estimates for the remaining pair parameters of the ARIMA model, p and q.

2.4.1 Augmented Dickey-Fuller (ADF) Test Results:

The Augmented Dickey-Fuller (ADF) test is a unit root test that determines how strongly a time series is defined by a trend. It is commonly used to assess whether a given time series is stationary. Stationarity is an important assumption in time series forecasting, as many predictive models require the time series to be stationary.

The null hypothesis of the ADF test is that a unit root is present in the time series sample, meaning the time series is non-stationary. The alternative hypothesis, on the other hand, is that there is no unit root, which indicates stationarity.

library(tseries)
# Applying Augmented Dickey-Fuller test
adf.test(bj.diff_no_na, alternative = "stationary", k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bj.diff_no_na
## Dickey-Fuller = -24.399, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

The p-value for the ADF test on the first-differenced series is 0.01. This means that we can reject the null hypothesis at the 5% significance level, suggesting that the differenced series is stationary.

##Decomposition of Time Series:

Time series decomposition involves splitting a time series into several components, each representing an underlying pattern category: trend, seasonality, and remainder (often referred to as the noise or irregular component).

count_ma = ts(bj.diff_no_na , frequency=365)
decomp = stl(count_ma, s.window="periodic")
deseasonal_count <- forecast::seasadj(decomp)

# decomp_df from google_trends_data:
ts_data <- ts(google_trends_data$Jobs, frequency = 12)  # for monthly data
decomp_obj <- stl(ts_data, s.window = "periodic")

decomp_df <- data.frame(
  Date = google_trends_data$Date,
  Seasonal = as.vector(decomp_obj$time.series[, "seasonal"]),
  Trend = as.vector(decomp_obj$time.series[, "trend"]),
  Remainder = as.vector(decomp_obj$time.series[, "remainder"])
)

library(ggplot2)
library(gridExtra)

# Plotting original data
p1 <- ggplot(google_trends_data, aes(x = Date, y = Jobs)) +
  geom_line(color = "black") +
  labs(title = "Original Data")

# Plotting Seasonal component
p2 <- ggplot(decomp_df, aes(x = Date, y = Seasonal)) +
  geom_line(color = "blue") +
  labs(title = "Seasonal Component")

2.4.2 Seasonal Component:

It captures the regular pattern that repeats at known intervals. The “blue” line in the second plot indicates the seasonal fluctuations in the ‘Jobs’ data. If this pattern repeats annually, it could be linked to recurring events that happen at the same time every year, like holiday seasons or annual sales events.

# Plotting Trend component
p3 <- ggplot(decomp_df, aes(x = Date, y = Trend)) +
  geom_line(color = "green") +
  labs(title = "Trend Component")

2.4.3 Trend Component:

It represents the underlying trend in the data if we were to smooth out the seasonal and irregular fluctuations. The “green” line in the third plot reveals the long-term trajectory of the ‘Jobs’ data, giving an insight into whether it’s generally increasing, decreasing, or stable.

# Plotting Remainder component
p4 <- ggplot(decomp_df, aes(x = Date, y = Remainder)) +
  geom_line(color = "red") +
  labs(title = "Remainder Component")

# Combine the plots
grid.arrange(p1, p2, p3, p4, ncol = 1)

2.4.4 Remainder Component:

This component captures the random fluctuations in the data after removing the trend and seasonal components. These fluctuations can be due to random events or noise. The “red” line in the fourth plot displays these irregularities.

tseries::adf.test(count_ma, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  count_ma
## Dickey-Fuller = -19.007, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary

The ‘Jobs’ time series displays clear seasonal patterns, an underlying trend, and sporadic irregularities. The Augmented Dickey-Fuller (ADF) test, with a significant Dickey-Fuller value of -19.007 and a p-value of 0.01, confirms the series’ stationarity. This stationarity is pivotal for numerous forecasting methods that necessitate stationary input data.

##Forecast for 2012

Building an ARIMA model and forecasting the trend for the next 365 days (for 2012).

library(forecast)
library(ggplot2)
library(plotly)


# Fit ARIMA model
model <- auto.arima(google_trends_data$Jobs,start.p=1,start.q=1,seasonal=T)
model
## Series: google_trends_data$Jobs 
## ARIMA(1,1,0)(0,1,0)[365] 
## 
## Coefficients:
##          ar1
##       0.4473
## s.e.  0.0199
## 
## sigma^2 = 0.0005895:  log likelihood = 4654.94
## AIC=-9305.89   AICc=-9305.88   BIC=-9294.66
# Forecast for the next 365 days
forecast_result <- forecast(model, h = 365)

# Create a new date sequence for the forecasted data
last_date <- max(google_trends_data$Date)
forecast_dates <- seq.Date(from = last_date + 1, by = "days", length.out = 365)

# Combine the forecasted data with the date sequence
forecast_data <- data.frame(
  Date = forecast_dates,
  Forecast = forecast_result$mean,
  Lower_80 = forecast_result$lower[, 1],
  Upper_80 = forecast_result$upper[, 1],
  Lower_95 = forecast_result$lower[, 2],
  Upper_95 = forecast_result$upper[, 2]
)

# Plot the original and forecasted data
forecast_plot <- ggplot() +
  geom_line(data = google_trends_data, aes(x = Date, y = Jobs), color = "black") +
  geom_line(data = forecast_data, aes(x = Date, y = Forecast), color = "blue") +
  geom_ribbon(data = forecast_data, aes(x = Date, ymin = Lower_80, ymax = Upper_80), fill = "blue", alpha = 0.2) +
  geom_ribbon(data = forecast_data, aes(x = Date, ymin = Lower_95, ymax = Upper_95), fill = "blue", alpha = 0.1) +
  labs(title = "SARIMA Forecast for Google Trends 'Jobs' Data in 2012 (ARIMA(1,1,0)(0,1,0)[365])",
       x = "Date",
       y = "Jobs Index") +
  theme(plot.title = element_text(hjust = 0.5))

# Convert ggplot to plot_ly
forecast_plotly <- ggplotly(forecast_plot)

# Display the plot
forecast_plotly

The ‘Jobs’ data is represented by the ARIMA(1,1,0)(0,1,0)[365] model, highlighting a significant influence from its preceding value. This model, which incorporates a first-order autoregressive term and a differencing step for stationarity, appears to be a good fit for the data, as evidenced by its low variance and optimal AIC/BIC values.

The ARIMA(1,1,0) model indicates that the “Jobs” series in the Google Trends data is primarily influenced by its immediate past observation, as captured by the AR term of order 1. This aligns with the observed significant spike at lag 1 in both ACF and PACF plots. The chosen model does not incorporate any MA terms or seasonal components, despite the data’s potential seasonal nature. The model’s goodness-of-fit is measured using the AIC, AICc, and BIC values. Lower values are preferred for model comparison, with AIC and BIC being commonly used criteria. The AIC values is -9305.89, while the BIC is -9294.66.

2.5 Residual Checking

We can examine further the ACF and the PACF plots and the residuals to determine the model quality. When the model order parameters and structure are correctly specified, we expect no significant autocorrelations present in the model residual plots.

acf1 <- Acf(residuals(model), plot = F)
pacf1 <- Pacf(residuals(model), plot = F)

plot_ly(x = ~acf1$lag[-1,1,1], y = ~acf1$acf[-1,1,1], type = "bar", name="ACF") %>%  # ACF
  # PACF
  add_trace(x = ~pacf1$lag[-1,1,1], y = ~pacf1$acf[-1,1,1], type = "bar", name="PACF") %>%
  # add CI lines
  add_lines(x=~c(acf1$lag[2,1,1], acf1$lag[length(acf1$lag),1,1]), y=~c(ci, ci), 
            mode="lines", name="Upper CI", line=list(dash='dash')) %>%
  add_lines(x=~c(acf1$lag[2,1,1], acf1$lag[length(acf1$lag),1,1]), y=~c(-ci, -ci), 
            mode="lines", name="Lower CI", line=list(dash='dash')) %>%
  layout(bargap=0.8, title="ACF & PACF Plot - ARIMA Model Residuals",
         xaxis=list(title="lag"), yaxis=list(title="ACF/PACF (Residuals) - ARIMA(1,1,0)(0,1,0)[365]"),
         hovermode = "x unified", legend = list(orientation='h'))
displayForecastErrors <- function(forecastErrors)
  {
     # Generate a histogram of the Forecast Errors
     binsize <- IQR(forecastErrors)/4
     sd   <- sd(forecastErrors)
     min  <- min(forecastErrors) - sd
     max  <- max(forecastErrors) + sd
     
     # Generate 5K normal(0,sd) RVs 
     norm <- rnorm(5000, mean=0, sd=sd)
     min2 <- min(norm)
     max2 <- max(norm)
     if (min2 < min) { min <- min2 }
     if (max2 > max) { max <- max2 }
     
     # Plot red histogram of the forecast errors
     bins <- seq(min, max, binsize)
     # hist(forecastErrors, col="red", freq=FALSE, breaks=bins)
     # 
     # myHist <- hist(norm, plot=FALSE, breaks=bins)
     # 
     # # Overlay the Blue normal curve on top of forecastErrors histogram
     # points(myHist$mids, myHist$density, type="l", col="blue", lwd=2)
     
     histForecast <- hist(forecastErrors, breaks=bins, plot = F)
     histNormal <- hist(norm, plot=FALSE, breaks=bins)
     plot_ly(x = histForecast$mids, y = histForecast$density, type = "bar", name="Forecast-Errors Histogram") %>%
        add_lines(x=histNormal$mids/2, y=2*dnorm(histNormal$mids, 0, sd), type="scatter", mode="lines", 
                  name="(Theoretical) Normal Density") %>%
        layout(bargap=0.1, title="Histogram (ARIMA(1,1,0)(0,1,0)[365] Model Residuals",
               legend = list(orientation = 'h'))
}

displayForecastErrors(residuals(model))

Based on the provided partial autocorrelations of the residuals from the model (up to a lag of 730), it is observed that many of the values are close to zero and thus may not be statistically significant. While there are some lags with more noticeable autocorrelations, the majority do not indicate strong patterns. This suggests that the residuals are largely white noise, which is a desired property in time series modeling.

3 Latent variables model

3.1 Data Exploration and Correlation Evaluation

library(readr)
data <- read_csv("I:/UBD PB/ZH5103 Predictive/Longitudinal/16_HandwrittenEnglish_Letters.csv")

# Preview the data
head(data)
## # A tibble: 6 × 17
##   letter  xbox  ybox width height onpix  xbar  ybar x2bar y2bar xybar x2ybar
##   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 T          2     8     3      5     1     8    13     0     6     6     10
## 2 I          5    12     3      7     2    10     5     5     4    13      3
## 3 D          4    11     6      8     6    10     6     2     6    10      3
## 4 N          7    11     6      6     3     5     9     4     6     4      4
## 5 G          2     1     3      1     1     8     6     6     6     6      5
## 6 S          4    11     5      8     3     8     8     6     9     5      6
## # ℹ 5 more variables: xy2bar <dbl>, xedge <dbl>, xedgey <dbl>, yedge <dbl>,
## #   yedgex <dbl>
# Checking summary statistics
summary(data)
##     letter               xbox             ybox            width       
##  Length:20000       Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 3.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  Mode  :character   Median : 4.000   Median : 7.000   Median : 5.000  
##                     Mean   : 4.024   Mean   : 7.035   Mean   : 5.122  
##                     3rd Qu.: 5.000   3rd Qu.: 9.000   3rd Qu.: 6.000  
##                     Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      height           onpix             xbar             ybar     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 4.000   1st Qu.: 2.000   1st Qu.: 6.000   1st Qu.: 6.0  
##  Median : 6.000   Median : 3.000   Median : 7.000   Median : 7.0  
##  Mean   : 5.372   Mean   : 3.506   Mean   : 6.898   Mean   : 7.5  
##  3rd Qu.: 7.000   3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 9.0  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.0  
##      x2bar            y2bar            xybar            x2ybar      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.: 7.000   1st Qu.: 5.000  
##  Median : 4.000   Median : 5.000   Median : 8.000   Median : 6.000  
##  Mean   : 4.629   Mean   : 5.179   Mean   : 8.282   Mean   : 6.454  
##  3rd Qu.: 6.000   3rd Qu.: 7.000   3rd Qu.:10.000   3rd Qu.: 8.000  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      xy2bar           xedge            xedgey           yedge       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 7.000   1st Qu.: 1.000   1st Qu.: 8.000   1st Qu.: 2.000  
##  Median : 8.000   Median : 3.000   Median : 8.000   Median : 3.000  
##  Mean   : 7.929   Mean   : 3.046   Mean   : 8.339   Mean   : 3.692  
##  3rd Qu.: 9.000   3rd Qu.: 4.000   3rd Qu.: 9.000   3rd Qu.: 5.000  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      yedgex      
##  Min.   : 0.000  
##  1st Qu.: 7.000  
##  Median : 8.000  
##  Mean   : 7.801  
##  3rd Qu.: 9.000  
##  Max.   :15.000
# Plotting correlations
library(corrplot)
# Only select numeric columns from the data
numeric_data <- data[, -c(1)]

# Now compute the correlations
correlations <- cor(numeric_data, use = "complete.obs")
corrplot(correlations, method = "circle")

##Justification for Latent Variable Model

Given the correlations observed, it’s likely that some variables are influenced by underlying unobserved (latent) factors. For instance, certain patterns in handwritten letters may be driven by inherent traits or styles of individuals that are not directly observable.

##Data Conversion and Scaling

Before applying a structural equation model (SEM), it’s crucial to scale the data to ensure variables are on comparable scales.

scaled_data <- scale(data[, -c(1)])
scaled_data <- data.frame(letter = data$letter, scaled_data)
# Binary Classification
data$letterA <- ifelse(scaled_data$letter == 'A', 1, 0)
# Add ID
data$id <- seq(1, nrow(data))

##Fitting SEM

We’ll use the lavaan package to fit an SEM.

library(lavaan)

# Defining a hypothetical model with latent variables
model <- '
letterA ~ xbox+ybox+width+height+onpix+xbar+ybar+x2bar+y2bar+xybar+x2ybar+xy2bar+xedge+xedgey+yedge+yedgex
'
fit <- sem(model, data = data)
summary(fit)
## lavaan 0.6.16 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        17
## 
##   Number of observations                         20000
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   letterA ~                                           
##     xbox             -0.023    0.001  -15.910    0.000
##     ybox              0.003    0.001    4.028    0.000
##     width             0.045    0.001   31.452    0.000
##     height            0.004    0.001    4.109    0.000
##     onpix            -0.030    0.001  -25.927    0.000
##     xbar              0.004    0.001    5.956    0.000
##     ybar             -0.019    0.001  -26.794    0.000
##     x2bar            -0.013    0.000  -27.721    0.000
##     y2bar            -0.040    0.001  -67.615    0.000
##     xybar            -0.006    0.001  -10.558    0.000
##     x2ybar           -0.013    0.001  -20.589    0.000
##     xy2bar            0.008    0.001   13.599    0.000
##     xedge            -0.016    0.001  -19.367    0.000
##     xedgey           -0.010    0.001  -10.993    0.000
##     yedge             0.009    0.001   15.424    0.000
##     yedgex           -0.024    0.001  -30.265    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .letterA           0.024    0.000  100.000    0.000

3.2 Summarize and Interpret Outputs

Convergence: The model successfully converged normally after just one iteration, which is a good sign indicating that the model fit the data well and that the optimization algorithm didn’t struggle to find the best parameter estimates. Number of Observations: The model used a data set of 20,000 observations.

Model Parameters: There are 17 model parameters.

Estimator: Maximum Likelihood (ML) was used as the estimator.

Model Test Statistic: The test statistic for the model is 0.000 with degrees of freedom equal to 0. This suggests that the model is saturated (perfect fit), as there are no degrees of freedom left.

3.2.1 Parameter Estimates:

The parameter estimates section provides detailed information about how each observed variable relates to the outcome variable, letterA. Here’s a breakdown:

Each predictor variable (e.g., xbox, ybox, width, etc.) has an associated regression coefficient (Estimate), standard error (Std.Err), z-value (Z-value), and a p-value (P(>|z|)). The regression coefficient represents the change in the letterA variable for a one-unit change in the predictor variable, holding other predictors constant.

For instance, the predictor xbox has a coefficient of -0.023. This means that for every unit increase in xbox, the letterA variable decreases by 0.023 units, holding other variables constant. The negative sign indicates an inverse relationship.

The p-values for all predictors are 0.000, indicating that all predictors are statistically significant at any conventional alpha level (e.g., 0.05). This means that all predictors have a significant relationship with the outcome letterA.

3.2.2 Variances:

The variance of the letterA is 0.024. This can be interpreted as the amount of variability in letterA that is not explained by the predictors in the model.

3.2.3 Insights:

Relationships: Variables like xbox, onpix, ybar, x2bar, y2bar, and yedgex have a negative relationship with letterA. This means as these variables increase, the probability of the letter being A decreases. On the other hand, variables like ybox, width, height, xbar, xybar, xy2bar, xedge, and yedge have a positive relationship with letterA.

Model Fit: Given that the degrees of freedom are 0, it suggests that the model is saturated. This means that the model perfectly fits the data. However, in practical scenarios, a saturated model may not always be desirable as it could lead to overfitting.

Significance: All the predictors in the model are statistically significant, which suggests that they all play a role in predicting the outcome variable.

Clinical Interpretation: In the context of MRI or any imaging data, such relationships can help in understanding the significance of certain predictors (in this case, voxel intensities or features) in determining the outcome (here, letterA). This can be crucial in diagnostic, predictive, or exploratory analyses.

##Fit GEE and GLMM with Latent Variable as Response

Here, we use the latent variable from the SEM model as the response in GEE and GLMM models and compare their AICs.

##install.packages(c("gee","lm4"))
library(geepack)
gee_fit <- geeglm(letterA ~ xbox+ybox+width+height+onpix+xbar+ybar+x2bar+y2bar+xybar+x2ybar+xy2bar+xedge+xedgey+yedge+yedgex,
                 family = binomial, data = data, id = id, corstr = "exchangeable", scale.fix = TRUE)
summary(gee_fit)
## 
## Call:
## geeglm(formula = letterA ~ xbox + ybox + width + height + onpix + 
##     xbar + ybar + x2bar + y2bar + xybar + x2ybar + xy2bar + xedge + 
##     xedgey + yedge + yedgex, family = binomial, data = data, 
##     id = id, corstr = "exchangeable", scale.fix = TRUE)
## 
##  Coefficients:
##             Estimate  Std.err    Wald Pr(>|W|)    
## (Intercept)  6.32680  0.67455  87.972  < 2e-16 ***
## xbox        -0.79715  0.08162  95.377  < 2e-16 ***
## ybox         0.17278  0.04194  16.971 3.79e-05 ***
## width        0.95288  0.08780 117.776  < 2e-16 ***
## height       0.02650  0.06415   0.171  0.67951    
## onpix       -0.52285  0.06702  60.859 6.11e-15 ***
## xbar         0.57388  0.04850 140.000  < 2e-16 ***
## ybar        -0.35119  0.04969  49.952 1.58e-12 ***
## x2bar       -0.49737  0.05788  73.849  < 2e-16 ***
## y2bar       -1.02491  0.04241 584.047  < 2e-16 ***
## xybar       -0.62855  0.04711 178.032  < 2e-16 ***
## x2ybar      -0.38722  0.04083  89.939  < 2e-16 ***
## xy2bar       0.61889  0.04537 186.087  < 2e-16 ***
## xedge       -0.17581  0.06376   7.603  0.00583 ** 
## xedgey      -0.18058  0.07089   6.489  0.01085 *  
## yedge        0.23322  0.03854  36.610 1.44e-09 ***
## yedgex      -0.69729  0.05548 157.954  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Scale is fixed.
## 
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha        0       0
## Number of clusters:   20000  Maximum cluster size: 1
# Fit GLMM
library(lme4)

glmm_fit <- glm(letterA ~ xbox + ybox + width + height + onpix + xbar + ybar + x2bar + y2bar + xybar + x2ybar + xy2bar + xedge + xedgey + yedge + yedgex+ (1 | id), data = data, family = binomial)
summary(glmm_fit)
## 
## Call:
## glm(formula = letterA ~ xbox + ybox + width + height + onpix + 
##     xbar + ybar + x2bar + y2bar + xybar + x2ybar + xy2bar + xedge + 
##     xedgey + yedge + yedgex + (1 | id), family = binomial, data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.585  -0.057  -0.019  -0.006   4.128  
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   6.3268     0.8249    7.67  1.7e-14 ***
## xbox         -0.7972     0.1143   -6.97  3.1e-12 ***
## ybox          0.1728     0.0519    3.33  0.00087 ***
## width         0.9529     0.1106    8.62  < 2e-16 ***
## height        0.0265     0.0761    0.35  0.72764    
## onpix        -0.5228     0.0816   -6.40  1.5e-10 ***
## xbar          0.5739     0.0590    9.72  < 2e-16 ***
## ybar         -0.3512     0.0561   -6.26  3.9e-10 ***
## x2bar        -0.4974     0.0512   -9.71  < 2e-16 ***
## y2bar        -1.0249     0.0487  -21.02  < 2e-16 ***
## xybar        -0.6285     0.0613  -10.25  < 2e-16 ***
## x2ybar       -0.3872     0.0488   -7.94  2.0e-15 ***
## xy2bar        0.6189     0.0623    9.94  < 2e-16 ***
## xedge        -0.1758     0.0594   -2.96  0.00310 ** 
## xedgey       -0.1806     0.0624   -2.90  0.00378 ** 
## yedge         0.2332     0.0404    5.78  7.6e-09 ***
## yedgex       -0.6973     0.0576  -12.10  < 2e-16 ***
## 1 | idTRUE        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6647.7  on 19999  degrees of freedom
## Residual deviance: 1392.7  on 19983  degrees of freedom
## AIC: 1427
## 
## Number of Fisher Scoring iterations: 10
# Compare AIC
AIC(fit)
## [1] -17618
#install.packages("MuMIn")
library("MuMIn")
QIC(gee_fit)
##  QIC 
## 1412
AIC(glmm_fit)
## [1] 1427
# Given AIC values
aic_values <- c(GLMM=AIC(glmm_fit), GEE=QIC(gee_fit))

# Create a data frame
aic_df <- data.frame(Model = names(aic_values), 
                     AIC = as.numeric(aic_values))

# Transpose the data frame
transposed_aic_df <- as.data.frame(t(aic_df))

# Rename columns
colnames(transposed_aic_df) <- transposed_aic_df[1, ]
transposed_aic_df <- transposed_aic_df[-1, ]
colnames(transposed_aic_df)[colnames(transposed_aic_df) == "GEE.QIC"] <- "GEE"

# Print the transposed data frame
transposed_aic_df
##     GLMM  GEE
## AIC 1427 1412

The Akaike Information Criterion (AIC) is a measure used to compare different statistical models. It takes into account both the goodness of fit of the model and the complexity of the model. The key idea behind AIC is to penalize the addition of unnecessary parameters, which can lead to overfitting. A lower AIC value indicates a better model when comparing multiple models.

The GEE has a lower AIC than the GLMM, suggesting that, between these two models, the GEE is a better fit to the data after penalizing for model complexity.

4 LSTM Model

The Max Planck Institute in Jena, Germany, provides a comprehensive climate dataset from 2009-2017, recording every 10 minutes. This equates to 473,111 observations over 9 years, detailing attributes like humidity, atmospheric pressure, and temperature. Our primary analysis revolves around predicting temperature trends, a crucial variable in understanding climate dynamics.We’ve partitioned the dataset, using the initial 320,000 observations for training and the rest for testing.

library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack
library(recipes)
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:Matrix':
## 
##     update
## The following object is masked from 'package:lavaan':
## 
##     update
## The following object is masked from 'package:stats':
## 
##     step
library(tibbletime)
## 
## Attaching package: 'tibbletime'
## The following object is masked from 'package:stats':
## 
##     filter
# Load the dataset
climate_data <- read_csv("I:/UBD PB/ZH5103 Predictive/Longitudinal/ClimateData_2009_2017_Jena_Germany_MaxPlanck.csv")
## Rows: 473111 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Date_Time
## dbl (14): p_mbar, T_degC, Tpot_K, Tdew_degC, rh_percent, VPmax_mbar, VPact_m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Consider only Date_Time and T_degC
climate_data <- climate_data %>% select(Date_Time, T_degC)
climate_data$Date_Time <- as.POSIXct(climate_data$Date_Time, format="%d.%m.%Y %H:%M:%S")

# Splitting the data
train_data <- climate_data[1:320000, ]
test_data <- climate_data[320001:nrow(climate_data), ]

features <- names(train_data)[-1] # Now, this will only select T_degC

# Normalize T_degC for both training and test data
mean_train <- mean(train_data$T_degC)
sd_train <- sd(train_data$T_degC)

#train_data.std <- scale(train_data$T_degC)
#test_data.std <- scale(test_data$T_degC)

# Prepare Sequences
sequence_length <- 1

create_sequences <- function(data, sequence_length) {
  x <- list()
  y <- list()
  
  for(i in 1:(nrow(data) - sequence_length)) {
    x_end <- i + sequence_length - 1
    x_seq <- data[i:x_end, features]
    y_val <- data[x_end + 1, features]
    
    x[[length(x) + 1]] <- x_seq
    y[[length(y) + 1]] <- y_val
  }
  
  x <- array(unlist(x), dim = c(length(x), sequence_length, 1)) # Only one feature, T_degC
  y <- matrix(unlist(y), ncol = 1)
  list(x = x, y = y)
}

train_sequences <- create_sequences(train_data, sequence_length)
test_sequences <- create_sequences(test_data, sequence_length)

Using a sequence length of 10 minutes allows our Recurrent Neural Network (RNN) to capture short-term data patterns efficiently. With 10 epochs, we ensure a balance between model learning and computational efficiency, reducing the risk of overfitting.

#install.packages("remotes")
#remotes::install_github("rstudio/tensorflow", force = T)
#install.packages("reticulate")
library(reticulate)
use_python("C:/Users/Jun/AppData/Local/Programs/Python/Python38/python.exe")
library(tensorflow)
#tensorflow::install_tensorflow(version = "2.13.*")
#virtualenv_create("r-tensorflow")
#use_virtualenv("r-tensorflow", required = TRUE)
#install_tensorflow()

#devtools::install_github("rstudio/keras")
library(keras)
#install_keras()

# Define the LSTM model structure
model <- keras_model_sequential() %>%
  layer_lstm(units = 10, input_shape = c(sequence_length, 1), return_sequences = TRUE) %>%
  layer_lstm(units = 10, return_sequences = FALSE) %>%
  layer_dense(units = 1)

# Compile the model
model %>% compile(
  optimizer = optimizer_adam(lr = 0.001),
  loss = 'mae'
)

# Train the model
history <- model %>% fit(
  train_sequences$x, train_sequences$y,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  shuffle = FALSE
)
## Epoch 1/10
## 9000/9000 - 41s - loss: 1.5846 - val_loss: 0.3851 - 41s/epoch - 5ms/step
## Epoch 2/10
## 9000/9000 - 35s - loss: 0.3075 - val_loss: 0.2671 - 35s/epoch - 4ms/step
## Epoch 3/10
## 9000/9000 - 33s - loss: 0.2285 - val_loss: 0.1679 - 33s/epoch - 4ms/step
## Epoch 4/10
## 9000/9000 - 35s - loss: 0.2150 - val_loss: 0.2010 - 35s/epoch - 4ms/step
## Epoch 5/10
## 9000/9000 - 33s - loss: 0.2085 - val_loss: 0.1675 - 33s/epoch - 4ms/step
## Epoch 6/10
## 9000/9000 - 36s - loss: 0.2050 - val_loss: 0.1741 - 36s/epoch - 4ms/step
## Epoch 7/10
## 9000/9000 - 43s - loss: 0.2040 - val_loss: 0.1648 - 43s/epoch - 5ms/step
## Epoch 8/10
## 9000/9000 - 41s - loss: 0.2018 - val_loss: 0.1854 - 41s/epoch - 5ms/step
## Epoch 9/10
## 9000/9000 - 35s - loss: 0.2002 - val_loss: 0.1646 - 35s/epoch - 4ms/step
## Epoch 10/10
## 9000/9000 - 33s - loss: 0.1993 - val_loss: 0.1641 - 33s/epoch - 4ms/step
# Predictions
predictions <- model %>% predict(test_sequences$x)
## 4785/4785 - 8s - 8s/epoch - 2ms/step
# Calculate RMSE
rmse <- sqrt(mean((test_sequences$y - predictions)^2))
rmse
## [1] 0.2942
# Extracting date-time values for test data
test_dates <- climate_data$Date_Time[(320001 + sequence_length):(nrow(climate_data))]

# Adjusting plot_data dataframe
plot_data <- data.frame(Actual = test_sequences$y, Predicted = predictions)
plot_data$Date_Time <- test_dates[1:length(plot_data$Actual)]

# Plotting with ggplot2
ggplot(plot_data, aes(x = Date_Time)) + 
  geom_line(aes(y = Actual, color = "Actual")) + 
  geom_line(aes(y = Predicted, color = "Predicted")) + 
  labs(title = "Actual vs Predicted Temperatures", x = "Time", y = "Temperature") + 
  scale_color_manual(values = c("Actual" = "blue", "Predicted" = "red")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x labels for better visibility

The graph seems to compare two time-series: one representing the actual temperature data in blue line (“Actual”) and the other showing the model’s predictions (“Predicted”) in red line.

The graph has shown that the predicted values seem to follow the actual values reasonably well (overlap majority / little deviation), suggesting that the model has managed to capture a significant amount of the underlying patterns in the data.

RMSE quantifies how much, on average, our predictions deviate from the actual observations. An RMSE of 0.2964333 indicates that, on average, the model’s predictions are off by approximately 0.2964 units from the actual temperature values.

The temperature is measured in degrees Celsius, an error of approximately 0.3°C could be considered relatively small and indicates that the model is quite accurate.

5 Text Generation

In our exploration of text generation, we’re implementing a character-level LSTM to generate text in the style of Nietzsche, the late-19th century German philosopher. Given the extensive nature of deep learning model training, knitting an R Markdown document can be time-consuming. To expedite the knitting process, especially for report generation or format checking, we’ll conditionally reduce the training epochs. Specifically, we’ll train for 10 epochs during knitting to avoid overfitting. To further enhance the efficiency, we employ the cache=TRUE option in R Markdown to store and reuse results, ensuring re-computation is avoided for unchanged code. Given memory constraints, we’ll also work with a sample of the data.

library(keras)
library(stringr)
## 
## Attaching package: 'stringr'
## The following object is masked from 'package:recipes':
## 
##     fixed
data.csv <- read.csv("https://umich.instructure.com/files/12554194/download?download_frd=1", header=T)
text <- tolower(data.csv$text)
#cat("Corpus length:", nchar(text), "\n")

maxlen <- 60
step <- 3

text <- paste(text, collapse = "")
text_indexes <- seq(1, nchar(text) - maxlen, by = step)

# Sample data:  R cannot allocate the memory required for the operation.
sample_size <- 100000  # or any feasible number less than 4,462,096
text_indexes <- sample(text_indexes, sample_size)

# This holds our extracted sequences
sentences <- str_sub(text, text_indexes, text_indexes + maxlen - 1)
# This holds the targets (the follow-up characters)
next_chars <- str_sub(text, text_indexes + maxlen, text_indexes + maxlen)
cat("Number of sequences: ", length(sentences), "\n")
## Number of sequences:  100000
# List of unique characters in the corpus
chars <- unique(sort(strsplit(text, "")[[1]]))
cat("Unique characters:", length(chars), "\n")
## Unique characters: 37
# Dictionary mapping unique characters to their index in `chars`
char_indices <- 1:length(chars) 
names(char_indices) <- chars
# Next, one-hot encode the characters into binary arrays.


cat("Vectorization...\n") 
## Vectorization...
x <- array(0L, dim = c(length(sentences), maxlen, length(chars)))
y <- array(0L, dim = c(length(sentences), length(chars)))
for (i in 1:length(sentences)) {
  sentence <- strsplit(sentences[[i]], "")[[1]]
  for (t in 1:length(sentence)) {
    char <- sentence[[t]]
    x[i, t, char_indices[[char]]] <- 1
  }
  next_char <- next_chars[[i]]
  y[i, char_indices[[next_char]]] <- 1
}

model <- keras_model_sequential() %>% 
  layer_lstm(units = 128, input_shape = c(maxlen, length(chars))) %>% 
  layer_dense(units = length(chars), activation = "softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)   

sample_next_char <- function(preds, temperature = 1.0) {
  preds <- as.numeric(preds)
  preds <- log(preds) / temperature
  exp_preds <- exp(preds)
  preds <- exp_preds / sum(exp_preds)
  which.max(t(rmultinom(1, 1, preds)))
}

#if (isTRUE(getOption("rmarkdown.pandoc.to", FALSE))) {
#  num_epochs <- 5  # reduced number of epochs for knitting
#} else {
#  num_epochs <- 50  # normal number of epochs for regular execution
#}

for (epoch in 1:10) {
  
  cat("Epoch", epoch, "\n")
  model %>% fit(x, y, batch_size = 64, epochs = 1) 
  
  start_index <- sample(1:(nchar(text) - maxlen - 1), 1)  
  seed_text <- str_sub(text, start_index, start_index + maxlen - 1)
  
  cat("--- Synth-Text Generation with seed:", seed_text, "\n\n")
  
  for (temperature in c(0.2, 0.5, 1.0)) {
    
    cat("----------- temperature:", temperature, "\n")
    cat(seed_text, "\n")
    
    generated_text <- seed_text
    
    for (i in 1:500) {
      
      sampled <- array(0, dim = c(1, maxlen, length(chars)))
      generated_chars <- strsplit(generated_text, "")[[1]]
      for (t in 1:length(generated_chars)) {
        char <- generated_chars[[t]]
        sampled[1, t, char_indices[[char]]] <- 1
      }
        
      preds <- model %>% predict(sampled, verbose = 0)
      next_index <- sample_next_char(preds[1,], temperature)
      next_char <- chars[[next_index]]
      
      generated_text <- paste0(generated_text, next_char)
      generated_text <- substring(generated_text, 2)
      
      cat(next_char)
    }
    cat("\n")
  }
}
## Epoch 1 
## 1563/1563 - 134s - loss: 1.7800 - 134s/epoch - 86ms/step
## --- Synth-Text Generation with seed: ed finger tip23 yo male unrestrained driver in mvc in garbag 
## 
## ----------- temperature: 0.2 
## ed finger tip23 yo male unrestrained driver in mvc in garbag 
## e boxes at work dx  contain at work and struck shoulder pain floor at work dx foot pain while at work dx  contain of head corneal at work                                                                                                                                                                                                                                                                                                                                                                           
## ----------- temperature: 0.5 
## ed finger tip23 yo male unrestrained driver in mvc in garbag 
## e fell on at work dx foot and pain by a boxey back pain  pain rash cornie on a boxes at work41yof at work  work back pain fell at work dx hand pain pain when ree fingers38yof                     dx efect at work dx lower at work dx elaceration lifting and lifting a the elores at work28yom was at work work bind the the head at work at work dx foot pain at work dx hand pain dx forearm dx hand pain38yom at work and was sitele burn dx   and fell on a come all dx a ran38yom walding while pain belt at
## ----------- temperature: 1 
## ed finger tip23 yo male unrestrained driver in mvc in garbag 
## r strucklee at work55yom espose pain as at work  dx leca lt foot strain29yof states comiar stal panee backspreasting head hand twas rorf on fell abrasion43yom dx at work dx ortoaburd removing twast paafto fell back dx states todient th urrwas back bworn contaiteysetts bearing exwork from lt foot 30 pat landing suptest a cats dx fert aslifting   dx 3gym knee elbowahing coofetied at work23yom idicle fell to peesoll dx fraceration34yom gomling work pat staopen on lect veal whist when on ering agdy 
## Epoch 2 
## 1563/1563 - 144s - loss: 1.3996 - 144s/epoch - 92ms/step
## --- Synth-Text Generation with seed: osing and it pinched his finger finger lac66yom works at air 
## 
## ----------- temperature: 0.2 
## osing and it pinched his finger finger lac66yom works at air 
## e was working a work at work dx laceration to lt for head injury to lt shoulder at work dx laceration to rt hand                                                                                                                                                                                                                                                                                                                                                                                                    
## ----------- temperature: 0.5 
## osing and it pinched his finger finger lac66yom works at air 
## b steps in stepped on a comptring back on a subject  work  dx lumbar strain44 yof c o body to rt shoulder strain to finger with finger pain   dx laceration w head dx slipped on a sted with mea fell on a car struck a compation up to finger strain54yom pt was was at work  s p was working on a cleaning down stepped to finger23yom was working sas body finger while at work dx down regor at work            low back pain53yom with to shoulder pain at work when in a contaninet states since stepped on a 
## ----------- temperature: 1 
## osing and it pinched his finger finger lac66yom works at air 
## s to l fanding on a molt 2 days agomer he work landed was lifting finger of war when cut h27 yom winh at work while wrist from steal newper eye  work lt atterking at 4umpur74 y o m low back pain stecking fewt crepping some metil pain  while and cut tith dx contaury tim tryied whiler heavy cut fx pincurm at work dbhor stepped sa3dy reshodent op low back pain35yo f twisted a states working down to der w was shepp38yof was durslage bela sind rt thumb cas door to arm at work29 yom lacerater ent he w
## Epoch 3 
## 1563/1563 - 1114s - loss: 1.3036 - 1114s/epoch - 713ms/step
## --- Synth-Text Generation with seed: was at work using a jack hammer now having low back pain  dx 
## 
## ----------- temperature: 0.2 
## was at work using a jack hammer now having low back pain  dx 
##  low back pain dx lac rt hand contusion lower back strain24 yo m c o low back pain after at work dx lac right sprain24 yo m c o rt shoulder pain after at work dx ankle sprain44 yo m c o back pain after bard at work dx contusion lower back pain24 yo m c o low back pain after at work dx contusion lower back at work dx sprain24 yo m c o low back pain after at work dx contusion  dx contusion sprain24 yom states at work and strained shoulder pain dx acute right the right index finger pain44 yo m c o 
## ----------- temperature: 0.5 
## was at work using a jack hammer now having low back pain  dx 
##  micer at work dx incteme head on finger lac concussion at work pt at work dx back pain48yof was punched a storained lifting a bar was at work dx purss dx lac   pt now pain39yom was cutting a shoulder pain at work dx acute right finger dx neck pain at work dx contusion lifting back last strain54yom at work using dight strained a strain  strained machiner landes sprain25yom was working in the lower back pain at work dx  works at c o lower back for ankle pain32 yof c o brist was at work dx finger 
## ----------- temperature: 1 
## was at work using a jack hammer now having low back pain  dx 
##  handation accheallste42 yo macermppa tw elormer work stopped at emped knee dx r eye arm6c yop naivele job in fder arm forearm was binst  hit madrial lobly with cutting landem than pt hourt to knee bckel dmal sprain21yof discal on lib doing wis hed went fx concussions a caudder28yom from in face concussion restrained iusesoalatign  dx accinstracicin 2calpting bracked lumber concrouse45yom 6 caught low curstruck bloce police work  to lawder pain26 yo m c o cut tirical sprain to back of whollwasho
## Epoch 4 
## 1563/1563 - 369s - loss: 1.2495 - 369s/epoch - 236ms/step
## --- Synth-Text Generation with seed: work dx puncture wound to left index finger31yom with pain t 
## 
## ----------- temperature: 0.2 
## work dx puncture wound to left index finger31yom with pain t 
## o r head on the head on the head on the head on the hand and strained shoulder dx laceration to finger dx shoulder  dx slippat fell of the head injury31yom with pain to r head on the head on the head on the right hip pain dx  hand  dx shoulder  dx finger dx shoulder pain dx  dx  laceration to finger dx finger contusion and struck head injury31yom with low back strain after rear and strained foot and fell at work dx  hand finger dx shoulder sprain25yom with pain to r head injury to rt foot and st
## ----------- temperature: 0.5 
## work dx puncture wound to left index finger31yom with pain t 
## o rt hand and sustained a patient to mid on the holder and fell the reports a piece of motor dx finger  work                                     dx  finger dx head inj all dx head injury3y y o m working to chest pain after lifting down avoutle down of her finger dx r a ceached shoulder  dx  ankle at work dx  shoulder pain30yom working at work when heavy lifting a pw to his eye at work dx  c o pain38 yom strained shoulder at work dx finger contusion to head injury  dx into wall spindler  s p pt n
## ----------- temperature: 1 
## work dx puncture wound to left index finger31yom with pain t 
## o r locse of 0r15yof sust pain x fraginer back on a grops injury dowk dx  dx fro2gh saw rowbards at 35th finger laceration41yom lifting on vece on naskd dur in cut right eye a to neal that shoured bur ridal chest 7xmbas wling tox3c c lbp loradialle felt soom a piecy cleang at work in theana38yom 2umen brian on a casoident fall into a has at work hearaadal dx  s s pt is by pk something thrt21 yof dut main of ope head injurywith foot a kige finger of sands 32am boghtoa ago while on a torighthen la
## Epoch 5 
## 1563/1563 - 383s - loss: 1.2129 - 383s/epoch - 245ms/step
## --- Synth-Text Generation with seed: y items working long hrs dx neuropathic pn47 yom dx accident 
## 
## ----------- temperature: 0.2 
## y items working long hrs dx neuropathic pn47 yom dx accident 
## ally at work and fell at work dx l knee pain dx laceration lifting a pallet dx laceration to lt hand pain                                                                                     dx lac finger lac contusion to lt hand pain dx laceration to lt foot dx laceration to lt hand at work dx laceration to finger pain44 yom c o low back pain after lifting a piece of metal at work dx laceration lt hand pain dx laceration low back pain43 yom states at work and fell at work dx hand pain dx lacerat
## ----------- temperature: 0.5 
## y items working long hrs dx neuropathic pn47 yom dx accident 
## ally while at work and sust up to his hand  pulling a patient  work spipped on the face her a patient  at work       contusion finger dx contusion low back pain36yom fell on l arm hip pt at work dx  dx lac finger fxfx dx l thumb             dx laceration to finger  dx lumbar strain29yof c o pain to lt hand and fell  work and works in the face cleaning a pallet by pal end on a popper of r a down of ankle  work dx leg    dx lumbar strain46yom with low back pain  felt a patient at work      hand wh
## ----------- temperature: 1 
## y items working long hrs dx neuropathic pn47 yom dx accident 
## ally at work drosing for ciece of ol f finger pain pain p abrasions 3 m 4pft contusion low arm pain lifting at work43yom has corneal abrasion finger with cardar on fa clifting albering heavy berts to finger popper cut himand compation08 yof strikee all thumb arm pain piece of hys working a floor29 yom s inco few offinger pain agogoty a foot phot splash at worklrimer to arm scalp on studnel rrectuce started dop to knee pain  leg pain arr5pom while moving prehibake dx  caught finger bit eyeper her
## Epoch 6 
## 1563/1563 - 494s - loss: 1.1893 - 494s/epoch - 316ms/step
## --- Synth-Text Generation with seed:   work  dx lt clavicle deformity46 yo male hurt using machin 
## 
## ----------- temperature: 0.2 
##   work  dx lt clavicle deformity46 yo male hurt using machin 
## e at work dx left shoulder dx legaling today dx laceration39yof at work and struck finger with a car decking on a compation dx contusion38yom with shoulder pain when a car decking his finger dx laceration27yom cut ring finger with a contusion to forearm a patient at work dx  ments was at work dx finger dx low back pain38yof sust l hand at work and fell on his ring finger with a contusion to hand at work dx finger dx shoulder dx finger strain44yom cut finger with a subject working as a contusion 
## ----------- temperature: 0.5 
##   work  dx lt clavicle deformity46 yo male hurt using machin 
## e on a car decking finger in a cerbitis of hand slicer  work dx finger sprain53yom laceration to lower back while at work dx chest strain38yof with at work trunk fell from a car radicusation to thumes24yof with finger at work and stepped a hand some     dx corneal abrasion38yof at work and fell on hand now c o while at work      dx left shoulder  dx arm lac46yof elbow pain in needle stick with a rot of water started hand at work          dx  dx finger contusion67yf working a legs knee and fell o
## ----------- temperature: 1 
##   work  dx lt clavicle deformity46 yo male hurt using machin 
## e fell at workchystele in a wound foread 9seren haile lac odheion54yom 15 ft of the wates ahom movic onfingeryix left thumbated dx fall needle struncture5h cmen pain bartcheratic sforewroth goulce streislies off lid while droinitis42 yom s p nep dropp  backs yest pvin cruthed x a wet forutof glove toy fill fx radma34 yom states 4well an acc contusion lt elbowprained inhalking went today injura24yof with emporection to stpntial hotmor foot puncturebhoud  work ht  dx knee contusionex at thr33yof w
## Epoch 7 
## 1563/1563 - 388s - loss: 1.1701 - 388s/epoch - 248ms/step
## --- Synth-Text Generation with seed: ll chi43yom putting handcuffs on a person and hurt l knee  k 
## 
## ----------- temperature: 0.2 
## ll chi43yom putting handcuffs on a person and hurt l knee  k 
## nee pain  dx lac to lower back pain44yom with lac to lower back pain at work                                               dx low back pain30yom with shoulder pain at work                                                                                                                                                                                                                                                                                                                                         
## ----------- temperature: 0.5 
## ll chi43yom putting handcuffs on a person and hurt l knee  k 
## nee abrdpating work at work dx  strain  strain fremoving while at work            head injury to chest pain                                         sprain  fing to head injury finger lac dx finger sprain30yom with fell off a resident at work dx upper arm pain fell at work            dx shoulder pain dx surn today c o low back pain30yom was working in an eye pain after a lifting heavy lifting heavy object at work dx lower back pain42yof was working with ankle at work    lac heavy object for work 
## ----------- temperature: 1 
## ll chi43yom putting handcuffs on a person and hurt l knee  k 
## nee shasrauthiunt was shoeling truck at work sprain her shoulder fell0grine at work56yom at work on subject  muscle surpaured head injury was coworker2x yom was exsost opudreved the boxes at work pt fell mided finger this am dx locutal connt it shelp  bit 241yom with of one bart while working sust right hand on theah andleasion22yf cleaning heavy obj plir and ankle       sprain40 ydm puncted employee tripped at work  o eir when she twisted ankle fell to cleaning at   employee at work      bc leg
## Epoch 8 
## 1563/1563 - 1754s - loss: 1.1508 - 1754s/epoch - 1s/step
## --- Synth-Text Generation with seed: k when a matt ress was set on fire pt co at jail dx smoke in 
## 
## ----------- temperature: 0.2 
## k when a matt ress was set on fire pt co at jail dx smoke in 
##  the face corneal abrasion  strained lower back pain48yom at work at work and felt a pain after a car at work                                                                                                                                                                                                                                                                                                                                                                                                       
## ----------- temperature: 0.5 
## k when a matt ress was set on fire pt co at jail dx smoke in 
## struck or finger dx lumbar strain45 yom injurihitt between a blade at work and felt a patient at work dx left knees  dx lumbar strain44 yo m c o low back pain w low back pain after mital splashed face dx strain48yom at work at work and felt a patient finger pain dx finger sprain31yom at work at work dx veining over a scind at work dx lumbar strain34yom pain to lt knee at work stanting a strain at work         dx laceration to lt thumb laceration56 yom at work at work and felt pain s p pain after
## ----------- temperature: 1 
## k when a matt ress was set on fire pt co at jail dx smoke in 
##  the thru mak ahes pain bin metal forkingxpacticam  work ps insafeted sweepey lked cut arm slipped off lacer weing of the pts handlower back pain2 of lifting ceme glene moving left forearm at work dx laceration  to lt forearm fingers to inleffg  work35yom feacr buting xome arrest cut right eye48yof sacked states at parea boxes it finger pain into work26yf c o l foot cleanic in the 5 days ago today pt s p tripped over knee with a heavy boxes odf of lifting and struck his thront dehours27 yom c o 
## Epoch 9 
## 1563/1563 - 419s - loss: 1.1347 - 419s/epoch - 268ms/step
## --- Synth-Text Generation with seed: ell on foot injurint toe  at work dx fx right great toe36yom 
## 
## ----------- temperature: 0.2 
## ell on foot injurint toe  at work dx fx right great toe36yom 
##  c o lt foot at work and fell from steel thrown at work dx shoulder  dx low back pain dx low back pain dx lumbar strain24yom c o lt lower back pain after lifting a boxes was at work dx low back pain dx shoulder pain dx lumbar strain44yof at work at work at at work and fell on his lower back pain dx lumbar strain24yof c o low back pain  s p pt states at work and fell off a standing on a car dropped at work dx sprain right ankle dx lumbar strain25yof at work at work at work when a car arm pain whe
## ----------- temperature: 0.5 
## ell on foot injurint toe  at work dx fx right great toe36yom 
##  was at work at work on a box sushing of farm work w shoulder pain sprain of shoulder44yom cart heavy cellt a resident when a patient of right on the boxes at work dx at shoulder dx  r laceration lower back pain55yof sus lac to lower arm at work arm pain  dx contusion of machine floor  strained finger22yf acc cut finger with a pain when a pipe strained eye pain44yom was at work posea carts to low back pain               dx lumbar strain21yof c o low back pain after lifting at work when back pain
## ----------- temperature: 1 
## ell on foot injurint toe  at work dx fx right great toe36yom 
##  at work securitive fing tractorh rt right eye bing o rt foot55yof fell co rt ground forearm propsed by dooak of the by it to the knee24yom dx a 2wk  broke pain and hit rt eyep into ressiclic ing  l op to with box at a craucrule to surrcifas bro5tyom painln affispes at  dx shoulder  pain pain facy at work dx pana at sandery at ptsing nasked ieseh belt standing24 yof stecked standing boxes when steelbjgy 39yom cut a led thru f lower a s4yof slipped over  of c pal saw outd iffilured bh gutt nail a
## Epoch 10 
## 1563/1563 - 2498s - loss: 1.1216 - 2498s/epoch - 2s/step
## --- Synth-Text Generation with seed:  at work  hand lac w tendon involvement52yom sust contusion  
## 
## ----------- temperature: 0.2 
##  at work  hand lac w tendon involvement52yom sust contusion  
## to rt thumb while working with a patient  work                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## ----------- temperature: 0.5 
##  at work  hand lac w tendon involvement52yom sust contusion  
## to hand on at work                                           dx lower back pain    dx shoulder pain leg dx finger lac24 yom dx contusion to lower back while working on a heavy lifting his of the the face work with a pallett while working dx finger dx acute finger lac37 yom pain to lower back pain to r hand in a car work while working dx back pain dx  finger sprain23yom was carrying a patient in the and fell on his foot at work and dropped back pain bust for a bus lift tipped and at work finger25
## ----------- temperature: 1 
##  at work  hand lac w tendon involvement52yom sust contusion  
## to concrete  dx laceration sirenoly box slipped his neck low back px injury28yom denewordes laceration with a strain his ce pt with sheet fall throwroat s is a fallings4oom honthers work pt w needreane spi was struck pull neck midder dx pt starts safole bulling struck head hand struck armmousiated foot an eftraumacic dx finger work lastaescic22 y m arm inte groin pain after grinder with shoulder pain finger33 yof c o a got cowant kis mohand using heavy 9face bulling fell hist2ft 3ch higs heavy b

Result reporting for temperature=0.2, temperature=0.5, and temperature=1.0. Lower temperature results may yield extremely repetitive and predictable text with realistic English-language structure. Higher temperatures may yield more elaborate synthetic text, however, new plausible words/phrases/tokens may sometimes be invented and break the local sentence structure or include semi-random strings of characters.